位置识别在机器人和车辆的重新定位和循环封闭检测任务中起着至关重要的作用。本文为基于激光雷达的位置识别寻求明确定义的全球描述符。与本地描述符相比,全球描述符在城市道路场景中表现出色,但通常依赖于观点。为此,我们提出了一个简单而坚固的全局描述符,称为壁画,通过利用傅立叶变换和圆形转移技术,可以分解重新访问期间的视点差异,并实现翻译和旋转不变性。此外,还提出了一种快速的两阶段姿势估计方法,以利用从场景中提取的紧凑型2D点云来估计位置回收后的相对姿势。实验表明,在来自多个数据集的不同场景的序列上,壁画表现出比同期方法表现出更好的性能。该代码将在https://github.com/soytony/fresco上公开获取。
translated by 谷歌翻译
In time series forecasting, decomposition-based algorithms break aggregate data into meaningful components and are therefore appreciated for their particular advantages in interpretability. Recent algorithms often combine machine learning (hereafter ML) methodology with decomposition to improve prediction accuracy. However, incorporating ML is generally considered to sacrifice interpretability inevitably. In addition, existing hybrid algorithms usually rely on theoretical models with statistical assumptions and focus only on the accuracy of aggregate predictions, and thus suffer from accuracy problems, especially in component estimates. In response to the above issues, this research explores the possibility of improving accuracy without losing interpretability in time series forecasting. We first quantitatively define interpretability for data-driven forecasts and systematically review the existing forecasting algorithms from the perspective of interpretability. Accordingly, we propose the W-R algorithm, a hybrid algorithm that combines decomposition and ML from a novel perspective. Specifically, the W-R algorithm replaces the standard additive combination function with a weighted variant and uses ML to modify the estimates of all components simultaneously. We mathematically analyze the theoretical basis of the algorithm and validate its performance through extensive numerical experiments. In general, the W-R algorithm outperforms all decomposition-based and ML benchmarks. Based on P50_QL, the algorithm relatively improves by 8.76% in accuracy on the practical sales forecasts of JD.com and 77.99% on a public dataset of electricity loads. This research offers an innovative perspective to combine the statistical and ML algorithms, and JD.com has implemented the W-R algorithm to make accurate sales predictions and guide its marketing activities.
translated by 谷歌翻译
时空视频接地(STVG)的重点是检索由自由形式的文本表达式描绘的特定物体的时空管。现有方法主要将这一复杂的任务视为平行框架的问题,因此遭受了两种类型的不一致缺点:特征对齐不一致和预测不一致。在本文中,我们提出了一个端到端的一阶段框架,称为时空的一致性变压器(STCAT),以减轻这些问题。特别是,我们引入了一个新颖的多模式模板,作为解决此任务的全球目标,该目标明确限制了接地区域并将所有视频框架之间的预测联系起来。此外,为了在足够的视频文本感知下生成上述模板,提出了一个编码器架构来进行有效的全局上下文建模。由于这些关键设计,STCAT享有更一致的跨模式特征对齐和管预测,而无需依赖任何预训练的对象探测器。广泛的实验表明,我们的方法在两个具有挑战性的视频基准(VIDSTG和HC-STVG)上胜过先前的最先进的,这说明了拟议框架的优越性,以更好地理解视觉与自然语言之间的关联。代码可在\ url {https://github.com/jy0205/stcat}上公开获得。
translated by 谷歌翻译
大型策划数据集是必要的,但是注释医学图像是一个耗时,费力且昂贵的过程。因此,最近的监督方法着重于利用大量未标记的数据。但是,这样做是一项具有挑战性的任务。为了解决这个问题,我们提出了一种新的3D Cross伪监督(3D-CPS)方法,这是一种基于NNU-NET的半监督网络体系结构,采用交叉伪监督方法。我们设计了一种新的基于NNU-NET的预处理方法,并在推理阶段采用强制间距设置策略来加快推理时间。此外,我们将半监督的损耗重量设置为与每个时期的线性扩展,以防止在早期训练过程中模型从低质量的伪标签中。我们提出的方法在MICCAI Flare2022验证集(20例)上,平均骰子相似系数(DSC)为0.881,平均归一化表面距离(NSD)为0.913。
translated by 谷歌翻译
设计私人投票规则是值得信赖的民主的重要问题。在本文中,根据差异隐私的框架,我们根据知名的Condorcet方法提出了三类随机投票规则:Laplacian Condorcet方法($ cm^{lap} _ \ lambda $),指数condorcet方法($ cmcmential condorcet方法^{exp} _ \ lambda $)和随机响应condorcet方法($ cm^{rr} _ \ lambda $),其中$ \ lambda $代表噪声级别。通过准确估计随机性引入的错误,我们表明$ cm^{exp} _ \ lambda $是大多数情况下最准确的机制。我们证明,我们的所有规则都满足绝对单调性,Lexi参与,概率帕累托效率,近似概率孔孔标准和近似SD-StrategyProofness。此外,$ cm^{rr} _ \ lambda $满足(非适当的)概率condorcet标准,而$ cm^{lap} _ \ lambda $和$ cm^{exp} _ \ \ lambda _ 。最后,我们将差异隐私视为投票公理,并讨论其与其他公理的关系。
translated by 谷歌翻译
利用6DOF(自由度)对象的姿势信息及其组件对于对象状态检测任务至关重要。我们展示了IKEA对象状态数据集,该数据集包含宜家家具3D模型,装配过程的RGBD视频,家具部件的6dof姿势及其边界盒。建议的数据集将在https://github.com/mxllmx/ikeaObjectstateTateDataSet上使用。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译